Skip to content

Improve PDF reader and SaveAsPdf compatibility#1755

Merged
PrzemyslawKlys merged 14 commits intoEvotecIT:masterfrom
PrzemyslawKlys:codex/pdf-review-worktree
Apr 9, 2026
Merged

Improve PDF reader and SaveAsPdf compatibility#1755
PrzemyslawKlys merged 14 commits intoEvotecIT:masterfrom
PrzemyslawKlys:codex/pdf-review-worktree

Conversation

@PrzemyslawKlys
Copy link
Copy Markdown
Member

Summary

  • improve PDF writer and Word-to-PDF robustness around footer rendering, output path validation, QuestPDF license restoration, and custom font registration retries
  • expand PDF reader and lightweight extractor compatibility for inherited page metadata/resources, nested forms, content arrays, compressed and filtered streams, inline dictionaries, escaped names, comments, string decoding, and predictor-based DecodeParms
  • add broad regression coverage for PDF reader/extractor and SaveAsPdf edge cases

Root cause

The PDF projects handled the library's own output well, but several external-PDF shapes and a few SaveAsPdf failure paths still relied on narrow assumptions around page inheritance, stream filters, string parsing, and cleanup behavior.

Validation

  • dotnet test OfficeIMO.Tests/OfficeIMO.Tests.csproj --filter "FullyQualifiedName~PdfReaderAndFooterRegressionTests"
  • dotnet test OfficeIMO.Tests/OfficeIMO.Tests.csproj --filter "FullyQualifiedName~Pdf"

Regression results were green on net8.0, net10.0, and net472.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c4c7387578

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +73 to +76
if (decodeParmsObj is PdfArray decodeParmsArray &&
filterIndex >= 0 &&
filterIndex < decodeParmsArray.Items.Count &&
ResolveDictionary(decodeParmsArray.Items[filterIndex], objects) is PdfDictionary indexedDict) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Resolve DecodeParms refs before indexing filter params

GetDecodeParms only handles DecodeParms when it is a direct dictionary or a direct array. If a PDF uses an indirect array (for example /DecodeParms 12 0 R where object 12 is [null << /Predictor ... >>]), this branch is skipped, so predictor settings are never applied and Flate-decoded content remains PNG-predicted garbage. That causes text extraction to fail on otherwise valid filtered streams that store decode params indirectly.

Useful? React with 👍 / 👎.

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 8, 2026

Codecov Report

❌ Patch coverage is 60.81193% with 473 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.94%. Comparing base (e6b0074) to head (c4c7387).
⚠️ Report is 17 commits behind head on master.

Files with missing lines Patch % Lines
OfficeIMO.Pdf/Reading/PdfTextExtractor.cs 52.07% 208 Missing and 57 partials ⚠️
...ficeIMO.Pdf/Reading/Filters/PngPredictorDecoder.cs 40.74% 24 Missing and 8 partials ⚠️
OfficeIMO.Pdf/Reading/Core/TextContentParser.cs 73.27% 17 Missing and 14 partials ⚠️
OfficeIMO.Pdf/Reading/Core/PdfReadPage.cs 80.14% 17 Missing and 11 partials ⚠️
OfficeIMO.Pdf/Reading/Core/PdfSyntax.cs 68.53% 19 Missing and 9 partials ⚠️
OfficeIMO.Pdf/Reading/Core/PdfReadDocument.cs 41.02% 12 Missing and 11 partials ⚠️
OfficeIMO.Pdf/Reading/Filters/StreamDecoder.cs 75.43% 5 Missing and 9 partials ⚠️
OfficeIMO.Pdf/Reading/Filters/Ascii85Decoder.cs 66.66% 9 Missing and 4 partials ⚠️
OfficeIMO.Pdf/Reading/Filters/RunLengthDecoder.cs 50.00% 7 Missing and 4 partials ⚠️
OfficeIMO.Pdf/Reading/Filters/AsciiHexDecoder.cs 59.09% 4 Missing and 5 partials ⚠️
... and 5 more
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1755      +/-   ##
==========================================
+ Coverage   69.68%   69.94%   +0.25%     
==========================================
  Files         983      993      +10     
  Lines      115229   117591    +2362     
  Branches    23598    24288     +690     
==========================================
+ Hits        80300    82246    +1946     
- Misses      24839    24962     +123     
- Partials    10090    10383     +293     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 88fdde728a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +956 to +960
foreach (var kid in kids.Items) {
var child = ResolveDict(kid, objects);
if (child is not null) {
TraversePagesNode(child, objects, result, visited);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Guard page-tree traversal against cyclic Kids references

TraversePagesNode recursively descends through every /Kids entry but the visited set is only used for leaf /Page objects, not intermediate /Pages nodes. If a malformed or malicious PDF has a /Kids cycle (child points back to an ancestor), ExtractAllText will recurse indefinitely and can stack-overflow the process. Add a visited check (or depth guard) for each traversed node before recursing into children.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f93de88543

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

return idx >= 0 ? idx + 6 : -1;
int searchFrom = start;
while (searchFrom >= 0 && searchFrom < text.Length) {
int streamIdx = IndexOfKeyword(text, "stream", searchFrom, text.Length);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Limit stream detection to real stream tokens

FindObjectEnd now treats the first occurrence of the substring stream as a stream start even inside ordinary object contents (for example, a metadata string like (mainstream)), then skips ahead to the next endstream in later objects. That can make object boundaries jump across objects and cause dictionaries to be misparsed as streams, corrupting the object map and breaking metadata/page parsing on valid PDFs containing the word "stream" in non-stream objects.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a30e38a337

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

}

char c = text[idx];
return char.IsWhiteSpace(c) || c is '/' or '<' or '>' or '[' or ']' or '(' or ')' or '%';
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Exclude name/string delimiters from stream keyword boundaries

FindObjectEnd relies on IndexOfKeyword("stream", ...), but HasKeywordBoundary currently treats /, (, and ) as valid token boundaries, so text like /stream (name value) or (stream) (literal string) inside a non-stream object is still recognized as a real stream marker. In that case the parser jumps ahead looking for endstream, which can shift object boundaries across later objects and corrupt the parsed object map for otherwise valid PDFs containing the standalone word stream in metadata or names.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2a8123a966

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

}
return $"ref:{r.ObjectNumber}";
}
if (c is PdfStream) return "stream:direct";
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve unique keys for direct page content streams

ContentsKey now returns the constant "stream:direct" for every page whose /Contents is a direct stream, but TraversePagesNodeDeepLimited uses this key for de-duplication via contentKeys.Add(key). That means valid PDFs with multiple pages using direct (non-referenced) content streams will keep only the first page and silently drop the rest from Pages. Use a per-page unique key (or disable content-key dedupe for direct streams) so distinct pages are not collapsed.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7d9765d51d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +232 to +234
if (_objects.TryGetValue(r.ObjectNumber, out var ind) && ind.Value is PdfArray referencedArray) {
return ContentsArrayKey(referencedArray);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve referenced content-array identity in page dedupe

When /Contents is an indirect reference to a PdfArray, this code now converts the key from ref:<objectNumber> to ContentsArrayKey(...), and TraversePagesNodeDeepLimited uses that key to drop “duplicates.” In valid PDFs, different page objects can point to different content-array objects that contain the same stream references (or reordered wrappers), and this change causes later pages to be silently skipped, reducing Pages and extracted text. Keep the indirect object identity in the dedupe key (or stop content-based dedupe) so distinct pages are not collapsed.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2fc7fa534f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +80 to +82
bool trackRecursion = formObjectNumber > 0;
if (trackRecursion && !activeForms.Add(formObjectNumber)) {
continue;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Guard direct form XObjects against recursive Do cycles

The recursion guard only tracks forms when formObjectNumber > 0, so direct /XObject form streams (returned with objectNumber = 0) are never added to activeForms. A PDF that uses direct form dictionaries with self-reference or a direct-form cycle will recurse indefinitely in CollectTextAndForms, eventually stack-overflowing during GetTextSpans(). Please key recursion tracking for direct streams too (for example by object identity or resource name path), not only indirect object numbers.

Useful? React with 👍 / 👎.

Comment on lines +501 to +503
bool trackRecursion = formObjectNumber > 0;
if (trackRecursion && !activeForms.Add(formObjectNumber)) {
return string.Empty;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Prevent infinite recursion for direct forms in text extractor

The lightweight extractor has the same guard gap: it only adds forms to activeForms when an indirect object number exists, but TryGetFormStream sets direct form streams to objectNumber = 0. If a page/form resources tree contains direct form XObjects that invoke themselves (or each other), ExtractTextFromContentStream will recurse without termination and can crash with stack overflow. Track direct-form recursion as well, not just indirect references.

Useful? React with 👍 / 👎.

@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, add credits to your account and enable them for code reviews in your settings.

@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, add credits to your account and enable them for code reviews in your settings.

@PrzemyslawKlys PrzemyslawKlys merged commit d47a303 into EvotecIT:master Apr 9, 2026
9 checks passed
@PrzemyslawKlys PrzemyslawKlys deleted the codex/pdf-review-worktree branch April 9, 2026 06:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant